An artificial neural network is an important model for training features of voice conversion (VC) tasks. Typically, neural\nnetworks (NNs) are very effective in processing nonlinear features, such as Mel Cepstral Coefficients (MCC), which\nrepresent the spectrum features. However, a simple representation of fundamental frequency (F0) is not enough for\nNNs to deal with emotional voice VC. This is because the time sequence of F0 for an emotional voice changes\ndrastically. Therefore, in our previous method, we used the continuous wavelet transform (CWT) to decompose F0\ninto 30 discrete scales, each separated by one third of an octave, which can be trained by NNs for prosody modeling\nin emotional VC. In this study, we propose the arbitrary scales CWT (AS-CWT) method to systematically capture F0\nfeatures of different temporal scales, which can represent different prosodic levels ranging from micro-prosody to\nsentence levels. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that then\nconvert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the F0\nfor an emotional voice simultaneously as well as outperform other state-of-the-art methods in terms of emotional VC.
Loading....